Introduction

This report investigates the performance of different aggregation methods for forecasting competition assessment, using the RCT-A dataset from the HFC competition. I evaluated the five aggregation methods and proposed an improvement based on the best-performing method.

The dataset was analysed using the data.table R package, which allows fast and memory efficient handling of data.

Data Structure

The first-year competition data comes in three main datasets:

  • rct-a-questions-answers.csv dataset contains metadata on the questions, such as dates, taggs, and descriptions. Variables that are important to this assignment are: discover IDs for the questions and answers (for joining of datasets), and the resolved probabilities for the answers (i.e. encoding for the true outcome).

  • rct-a-daily-forecasts.csv dataset contains daily forecast for each performer forecasting method, along with indexes that allow joining this dataset with the other crucial datasets. Variables that are important to this assignment are: date, discover IDs for the questions and answers, external prediction set ID (i.e. the ID that is common to to a predictor that is assigning probabilities to a set of possible answers), and the forecast value itself.

  • rct-a-prediction-sets.csv contains information on prediction sets, along with basic question and answer metadata, forecasted and final probability values, along with indexes that allow joining this dataset with the other datasets. This dataset seems to be redundant, as the important information can be found in the first two datasets.

Data Cleaning and Preprocessing

To reduce the size of the datasets, only the relevant columns of rct-a-questions-answers.csv and rct-a-daily-forecasts.csv were selected. These were:

From rct-a-daily-forecasts.csv:

  • date
  • discover question id
  • discover answer id
  • forecast
  • created at
  • external prediction set id

From rct-a-questions-answers.csv:

  • discover question id
  • discover answer id
  • answer resolved probability

The variables of interest were assessed for the presence of missing values, and these were subsequently removed. Lastly, only the most recent predictions per predictor per day were included in the analysis (although it seems that rct-a-daily-forecasts.csv dataset already contained only single predictions per predictor per day).

Aggregation Methods

I aggregated the individual forecasts for each of the question-day pair the using five different methods:

  • Arithmetic Mean: A simple average of all forecasts.
    \[ \text{Arithmetic Mean}(x) = \frac{1}{n} \sum_{i=1}^{n} x_i \]
  • Median: The middle value, which is robust to outliers.
    \[ \text{Median}(x) = \begin{cases} x_{\frac{n+1}{2}} & \text{if } n \text{ is odd} \\ \frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2} & \text{if } n \text{ is even} \end{cases} \]
  • Geometric Mean: A multiplicative average, reducing the influence of extreme forecasts.
    \[ \text{Geometric Mean}(x) = \exp\left(\frac{1}{n} \sum_{i=1}^{n} \log(x_i)\right) \]
    • In the case where any \(x_i = 0\), we add a small value \(\epsilon\) to avoid taking the logarithm of zero.
  • Trimmed Mean: The arithmetic mean after removing the top and bottom 10% of forecasts.
    \[ \text{Trimmed Mean}(x) = \frac{1}{n - 2k} \sum_{i=k+1}^{n-k} x_{(i)} \]
    • where \(k = \left\lfloor 0.1n \right\rfloor\) is the number of values removed from both the top and bottom of the sorted data.
  • Geometric Mean of Odds: Converts probabilities to odds before calculating the geometric mean.
    1. Convert probabilities \(p_i\) to odds: \[ \text{Odds}(p_i) = \frac{p_i}{1 - p_i} \]
    2. Compute the geometric mean of the odds: \[ \text{Geometric Mean of Odds}(p) = \exp\left(\frac{1}{n} \sum_{i=1}^{n} \log\left(\text{Odds}(p_i)\right)\right) \]
    3. Convert the result back to probabilities: \[ p = \frac{\text{Geometric Mean of Odds}}{1 + \text{Geometric Mean of Odds}} \]
Following table shows the aggregated data, using the 5 aggregation method, per day, question, and the possible answers:

Evaluation of Aggregation Methods

To evaluate the accuracy of each aggregation method, I computed the Brier score, which measures the mean squared error between the aggregated forecast and the actual outcome.

The Brier score is a measure of how close the predicted probabilities are to the actual outcomes. It is defined as the mean squared error between the predicted probabilities \(\hat{p}_i\) and the known outcomes \(y_i\), given by the formula:

\[ \text{Brier Score} = \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{r} \left( y_i - \hat{p}_i \right)^2 \]

where:

  • \(y_i\) is the actual outcome (0 or 1)
  • \(\hat{p}_i\) is the predicted probability for the event
  • \(r\) is the number of possible forecast outcomes
  • \(n\) is the total number of predictions

Brier score ranges from 0 to 1, where low values indicate better predictive capabilities.

Results

Following table shows the Brier scores for each question-day pair per aggregation method used. The final two columns, Best_Method and Ranked_Methods, show the best performing method (i.e. method with the lowest Brier score) and the order of the method performance, respectively:


The following table shows the ordered summary of the method performance, along with the percentage of question-day pairs in which the method outperformed the rest:


The best performing aggregation method was geometric mean (47.79% of prediction-day pairs (PDPs)), followed by the geometric mean of odds (31.42% of PDPs), median (10.54% of PDPs), and the arithmetic mean (10.25% of PDPs). The trimmed arithmetic mean never outperformed the other methods. This data suggest that methods that ignore information from extereme predictions (such as median, mean, and trimmed mean) fail to capture the true information from aggregate prediction. The geometric mean and geometric mean of odds appear to compete for the best prediction method, likely based on the nuances of the structure of the question and possible answers. Therefore this data suggest that the nature of question would dictate which aggregate method to use to most properly assess the aggregate performance of the predictors.

Improvement on Aggregation Methods

I propose an improvement to the geometric mean of odds by extremising the odds to penalise under-confidence in forecasters.

The extremised geometric mean of odds is calculated in the following steps:

  1. Convert probabilities \(p_i\) to odds:

\[ \text{Odds}(p_i) = \frac{p_i}{1 - p_i} \]

  1. Compute the geometric mean of the odds:

\[ \text{Geometric Mean of Odds} = \exp\left(\frac{1}{n} \sum_{i=1}^{n} \log\left(\text{Odds}(p_i)\right)\right) \]

  1. Apply extremisation by raising the geometric mean of odds to the power of 2.5:

\[ \text{Extremised Odds} = \left( \text{Geometric Mean of Odds} \right)^{2.5} \]

  1. Convert the extremised odds back into probabilities:

\[ p_{\text{extremised}} = \frac{\text{Extremised Odds}}{1 + \text{Extremised Odds}} \]

Prior to the assessment of the improved, best performing method, the rct-a-daily-forecasts.csv dataset was filtered to include only the data from the first day of the competition.

Following table shows the aggregated data, using the 6 aggregation method (including the improved method), on day 1, per question and the possible answers:

Following table shows the Brier scores for each question-day pair per aggregation method used. The final two columns, Best_Method and Ranked_Methods, show the best performing method (i.e. method with the lowest Brier score) and the order of the method performance, respectively:

The following table shows the ordered summary of the method performance, along with the percentage of question-day pairs in which the method outperformed the rest:


The best performing aggregation method was the extremised geometric mean of odds (42.86% of PDPs), followed by the arithmetic mean (28.57% of PDPs), median (19.05% of PDPs), the geometric mean (4.76% of PDPs), and the geometric mean of odds (4.76% of PDPs). The trimmed arithmetic mean never outperformed the other methods. Evidently, the extremised geometric mean of odds outperformed the other methods and thus was an clear improvement in the prediction evaluation. The working principle behind it is a modification of geometric mean of odds, where the geometric mean of odds is raised to the power of an extremising parameter, in this case equal to 2.5. This method is a correction for forecaster under-confidence. In the present dataset it was able to outcompete the other methods, however, it is likely that utilising it on a different dataset, which would contain less forecaster under-confidence would make it non-optimal.

Conclusion

The extremised geometric mean of odds provided the best aggregation performance, suggesting that penalising under-confident predictions can improve forecasting accuracy. However, the effectiveness of this method may vary depending on the dataset’s structure and the forecasters’ behaviour.